12. Multithreaded Processors Dezső Sima Fall 2006  D. Sima, 2006.

12. Multithreaded Processors Dezső Sima Fall 2006  D. Sima, 2006

2 Overview 3 Coarse grain multithreading 5 Simultaneous multithreading Overview 4 Fine grain multithreading 1 Introduction

Thread: flow of control Main features of multithreading: belong to the same process, share a common address space (usually, else multiple address translation paths (virtual to real) need to be maintained in parallel) are executed simultaneously (overlapped or in parallel). Thread management: creation, control and termination of threads, maintaining multiple sets of thread states, context swithing between threads. Aim of multithreading: to raise performance compared to superscalar execution or multitasking by increased parallelism at execution. 1. Introduction (1) Threads

Software implementation Implementation of multithreading Hardware implementation Execution of multithreaded apps/OSs on a single threaded processor by time sharing Execution of multithreaded apps/OSs on a multithreaded processor concurrently (while executing multithreaded apps/OSs) Fast context swithing between threads required. 1. Introduction (2) Maintaining multiple threads concurrently by the OS Maintaining multiple threads concurrently by the processor Multithreaded OSs Multithreaded processors

Basic options to implement multithreaded processors Multicore processorsMultithreaded cores Chip L3/Memory L2/L3 Core L3/Memory MT core L2/L3 1. Introduction (3) (SMP: Symmetric Multiprocessing CMP: Chip Multiprocessing)

Requirement of software multithreading: 1. Introduction (4) Core enhancements needed in case of multithreaded cores: Maintaining multiple thread states, including: PC, architectural registers, state registers (in case of merged arch. and rename registers providing appropriatly large file sizes (FX/FP)) Maintaning multiple thread microstates, pertaining to: rename mappings, the RAS (Return Address Stack), ROB, etc. Providing increased sizes for scarce or sensitive resorces, such as: the instruction buffer, store queue, etc. PC, FX/FP registers, state registers Maintaining multiple thread states concurrently by the OS, including:

1. Introduction (5) Multicore processors Multithreaded cores Additional complexity ~ (60 – 80) %~ (2 – 10) % Additional gain (in gen. purp. apps) ~ (60 – 80) %~ (0 – 30) %

1. Introduction (6) Windows NT OS/2 Unix w/Posix most OSs developed from the 90’s on Multithreaded OSs:

Sequential programm ing Multitask programmingMultithreaded programming P2 Process / Thread Management Example Principle of sequential-, multitask- and multithreaded programming P1 P2 P3 P2 P1 P3 fork() join() P1 T1 exec() T3 T2 T5 T4 CreateThread() T6 Create Process()

Sequential programs Multitask programsMultithreaded programs Software implementationSoftware impl.Hardware impl. Single process on a single processor Multiple processes on a single processor using time sharing Multithreaded software on a single threaded processor by time sharing Multithreaded software on a multithreaded processor No issues with parallel programs Multiple programs with quasi-parallel execution Private address spaces Multiple programs with quasi-parallel execution Shared process address spaces Faster intra-process context switches True parallel execution Shared process address spaces Near linear speedup Fastest intra-process context switches Sequential bottleneck Solutions for fast context switchingThread state management and context switching Thread state management Thread scheduling Description Key Advantages Key Issues Execution of sequential-, multitask- and multithreaded programs

Sequential programs Multitask programsMultithreaded programs Software implementationSoftware impl.Hardware impl. Legacy OS support Traditional UnixMost modern OS’s (Windows NT/2000, OS/2, Unix+Posix) LowLow-mediumHighHigher No API level support Process life cycle management APIProcess and thread life cycle management API Explicit threading API OpenMP Process and thread life cycle management API Explicit threading API OpenMP OS Support Software Development Performance Level Implementation of multiprocessing and multithreading (2)

2. Overview Thread scheduling while implementing software multithreading on a traditional supercalar processor Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading Figure 2.1: Thread scheduling in a traditional superscalar processor 2.1 Thread scheduling The execution of a new thread is initiated by a context switch (needed to save the state of the suspended thread and loading the state of the thread to be executed).

Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading Figure 2.2: Thread scheduling in an CMP Thread scheduling in CMP-s Cores execute different threads independently.

2. Overview Coarse grain MT Thread scheduling in multithreaded cores

Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading Figure 2.3: Thread scheduling in a 4-way coarse grained multithreaded processor Threads are switched by means of rapid, HW-supported context switches.

2. Overview Coarse grain MT Fine grain MT Thread scheduling in multithreaded cores

Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading Figure 2.4: Thread scheduling in a 4-way fine grained multithreaded processor The hardware thread scheduler choses a thread in each cycle and instructions from this thread are dispatched/issued in this cycle..

2. Overview Coarse grain MT Fine grain MT Simultaneous MT (SMT) Thread scheduling in multithreaded cores

Source: Mazzucco P., „Fundamentals of Multithreading,” http://www.slcentral.com/articles/01/6/multithreading Figure 2.5: Thread scheduling in a 4-way symultaneous multithreaded processor Available instructions (chosen according to an appropriate selection policy, such as the priority of the threads) are dispatched/issued for execution in each cycle. SMT: Proposed by Tullsen, Eggers and Levy in 1995 (U. of Washington).

Figure 2.6: Multithreaded cores (1) 2.2 Overview of multithreaded cores (1) Multi core multithreaded UltraSPARC T1 (Niagara) (2005) 8 cores/4T 0.09  /279 mtrs Dual core multithreaded Single core multithreaded POWER5 (2004) 2T 0.13  /276 mtrs RS64 IV (Sstar) (2000) 2T 0.18  /44 mtrs Alpha 21464 (V8) (2003) 4T 0.13  /250 mtrs IBM DEC/Compaq Sun Superscalars RISCs

2.2 Overview of multithreaded cores (2) Figure 2.7: Multithreaded cores (2) Multi core multithreaded Dual core multithreaded Single core multithreaded Pentium EE 840 (4/2005) 0.09  /230 mtrs Pentium 4 (Northwood) (2002) 0.13  /55 mtrs Intel Superscalars CISCs Pentium EE 955/965 (Presler) (4/2005) 0.065  /2*188 mtrs Montecito (2006?) 2*Itanium 2 (Madison) 0.09 /1730 mtrs. VLIWs Intel

Scalar core(s) Underlying core(s) Superscalar core(s) VLIW core(s) SUN UltraSPARC T1 (2005) (Niagara) up to 8 cores, 4 threads IBM RS64 IV (2000) (SStar) 2-way Pentium 4 (2002) 2-way DEC 21464 (2003) Dual-core/2-way IBM POWER5 (2005) Dual-core/2-way Pentium EE 840 (2005) Dual-core/2-way Pentium EE 955/965 (2005) Dual-core/2-way SUN MAJC 5200 (2000) Quad-core/4-way (dedicated use) Intel Montecito (2006?) Dual-core/2-way 2.2 Overview of multithreaded cores (3)

3. Coarse grain multithreading 3.1 Overview (1) Coarse grain MT Fine grain MT Simultaneous MT (SMT) Thread scheduling in multithreaded cores

3. Coarse grain multithreading Scalar based Coarse grain MT Superscalar based VLIW based SUN MAJC 5200 (2000) Quad-core/4T (dedicated use) Intel Montecito (2006?) Dual-core/2T IBM RS64 IV (2000) (SStar) 2T 3.1 Overview (2)

3.2 Case example 1: IBM RS 64 IV (1) Optimized for commercial server workloads, such as on-line transaction processing, Web-serving, ERP (Enterprise Resource Planning). 4-way superscalar, dual-threaded. Duplicated resources: ~ + 5 % chip area Instruction fetch width: 8 instr./cycle GPRs, FPRs, CR (condition reg.), CTR (count reg.), spec. purpose priviledged mode reg.s, such as the MSR (machine state reg..) status and control reg.s, such as T priority. Each T executes in its own effective address space. Architectural state: Units used for address translation need to be duplicated, such as the SRs (Segment Address Reg.s) Both single threaded and multithreaded modes of execution. Used in IBM’s iSeries and pSeries commercial servers. Microarchitecture

3.2 Case example 1: IBM RS 64 IV (2) Figure 3.1: Microarchitecture of IBM’s RS 64 IV Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898 IERAT: Effective to real address translation cache (2x64 entries) 6XX bus

Thread switching (strongly simplified): Two Ts are implemented; a foreground T and a background T. The foreground T executes until a long latency event, such as a cache miss or an IERAT miss occurs. Subsequently, a T switch is performed and the background T begins to execute. After the miss is serviced, a T switch back to the foreground T occurs. The Thread Swith Buffer holds up to 8 instructions from the background T, to eliminate the latency of the I$ 3.2 Case example 1: IBM RS 64 IV (3) Threads can be allocated different priorities by explicit instructions. large working sets and frequently occurring task switches need for large L1$s high cach miss rates Aim: Commercial workloads

3.2 Case example 1: IBM RS 64 IV (4) Figure 3.2: Thread switch on data cache miss in IBM’s RS 64 IV Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898

3.2 Case example 2: SUN MAJC 5200 (1) up to 4 processors on a die, each processor has 4 FUs (Functional Units); 3 of them are identical, one is enhanced, each FU has its private logic and register set (e.g. 32 or 64 regs., the 4 FUs of a processor share a set of global regs., e.g. 64 regs., all registers are unified (not splitted to FX/FP files), any FU can process any data type. Aim: Microarchitecture: Each processor is a 4-wide VLIW and can be 4-way multithreaded. Dedicated use, high-end graphics, networking with wire-speed computational demands.

3.2 Case example 2: SUN MAJC 5200 (2) Figure 3.3: General view of SUN’s MAJC 5200 Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc

3.2 Case example 2: SUN MAJC 5200 (3) Figure 3.4: The principle of private, unified register files associated with each FU Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc

3.2 Case example 2: SUN MAJC 5200 (4) Each processor with its 4 FUs can be operated in a 4-way multithreaded mode (called Vertical Multithreading by Sun) Thread switch: Implementation of 4-way multithreading: by executing each T by one of the 4 FUs („Vertical multithreading”) Following a cache miss, the processor saves the T state and begins to process the next T. Example: Comparison of program execution without and with multithreading on a 4-wide VLIW Considered program: It consists of 100 instructions, on average 2.5 instrs./cycle executed on average, giving birth to a cache miss after each 20 instructions. Latency of serving a cache miss: 75 cycles. Threading

3.2 Case example 2: SUN MAJC 5200 (5) Figure 3.5: Execution for subsequent cache misses in a single threaded processor Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc

3.2 Case example 2: SUN MAJC 5200 (6) Figure 3.6: Execution for subsequent cache misses in SUN’s MAJC 5200 Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc

3.2 Case example 3: Intel Montecito (1) Aim: Main differencies between Itanim2 and Montecito: Split L2 caches, higher unified L3 cache, duplicated architectural states maintained. Additional support of dual-threading: the branch prediction structures provide T tagging, per stack return stack strucktures, per thread ALATs (Advance Load Address Table) Additional core area needed: ~ 2 %. High end servers

3.2 Case example 3: Intel Montecito (2) Figure 3.7: Microarchitecture of Intel’s Itanium 2 Source: McNairy, C., „Itanium 2”, IEEE Micro, March/April 2003, Vol. 23, No. 2, pp. 44-55

3.2 Case example 3: Intel Montecito (3) Figure 3.8: Microarchitecture of Intel’s Montecito (ALAT: Advanced Load Address Table) Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20

3.2 Case example 3: Intel Montecito (4) Thread swithes: 5 event types cause thread switches, such as L3 cache misses, programmed switched hints. Total switch penalty: 15 cycles Example for thread switching: If control logic detects that a thread doesn’t make progress, a thread switch will be initiated.

Figure 3.9: Thread switch in Intel’s Montecito vs single thread execution 3.2 Case example 3: Intel Montecito (5) Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20

4. Fine grain multithreading 4.1 Overview (1) Coarse grain MT Fine grain MT Simultaneous MT (SMT) Thread scheduling in multithreaded cores

4. Fine grain multithreading Round robin selection policy Fine grain MT Priority based selection policy Scalar based Superscalar based VLIW based Scalar based Superscalar based VLIW based 4.1 Overview (2) SUN UltraSPARC T1 (2005) (Niagara) up to 8 cores/4T

4.2 Case example: SUN UltraSPARC T1 (1) web servicing, transaction processing, ERP (Enterprise Resource Planning), DSS (Decision Support Systems) Aim: Commercial server applications, such as Charasteristics of commercial server applications: large working sets, poor locality of memory references. high cache miss rates, low prediction accuracy for data dependent branches. Memory latency strongly limits performance. Multithreading to hide memory latency.

8 scalar cores, 4-way multithreaded each. Structure: All 32 threads share an L2 cache of 3 MB, built up of 4 banks, 4.2 Case example: SUN UltraSPARC T1 (2)

4.2 Case example: SUN UltraSPARC T1 (3) Figure 4.3: Block diagram of SUN’s UltraSPARC T1 Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29

8 scalar cores, 4-way multithreaded each. Structure: All 32 threads share an L2 cache of 3 MB, built up of 4 banks, 4 memory channels with on chip DDR2 memory controllers. It runs under Solaris. 4.2 Case example: SUN UltraSPARC T1 (2)

4.2 Case example: SUN UltraSPARC T1 (4) Figure 4.3: SUN’s UltraSPARC T1 chip Source: www.princeton.edu/~jdonald/research/hyperthreading/romanescu_niagara.pdf

4.2 Case example: SUN UltraSPARC T1 (5) Processor Elements (Sparc pipes): Scalar FX-units, 6-stage pipeline all Processor Elements share a single FP-unit

4.2 Case example: SUN UltraSPARC T1 (6) Figure 4.3: Microarchitecture of the core of SUN’s UltraSPARC T1 Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29

4.2 Case example: SUN UltraSPARC T1 (5) Processor Elements (Sparc pipes): Each thread of a processor element has its private: PC-logic register file, instruction buffer, store buffer. Scalar FX-units, 6-stage pipeline all Processor Elements share a single FP-unit

4.2 Case example: SUN UltraSPARC T1 (5) Processor Elements (Sparc pipes): Each thread of a processor element has its private: PC-logic register file, instruction buffer, store buffer. No thread switch penalty! Scalar FX-units, 6-stage pipeline all Processor Elements share a single FP-unit

4.2 Case example: SUN UltraSPARC T1 (7) Thread switch: Threads are switched on a per cycle basis. In the thread select pipeline stage the thread select multiplexer selects a thread from the set of available threads in each clock cycle and issues the subsequent instr. of this thread into the pipeline for execution. Selection of threads:

4.2 Case example: SUN UltraSPARC T1 (7) Thread switch: Threads are switched on a per cycle basis. In the thread select pipeline stage the thread select multiplexer selects a thread from the set of available threads in each clock cycle and issues the subsequent instr. of this thread into the pipeline for execution. Selection of threads: Threads become unavailable due to: long-latency instructions, such as loads, branches, multiplies, divides, pipeline stalls because of cache misses, traps, resource conflicts. Thread selection policy: the least recently used policy. 1.Example: all 4 threads are available.

4.2 Case example: SUN UltraSPARC T1 (8) Figure 4.3: Thread switch in the SUN’s UltraSPARC T1 when all threads are available Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29

4.2 Case example: SUN UltraSPARC T1 (9) 2. Example: There are only 2 threads available, speculative execution of instructions following a load. (Data referenced by a load instruction arrive in the 3. cycle after decoding, assuming a cache hit. So, after issuing a load the thread becomes unavailable for the next two subsequent cycles.)

4.2 Case example: SUN UltraSPARC T1 (10) Figure 4.3: Thread switch in the SUN’s UltraSPARC T1 when all threads are available Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29 (The add instruction from thread t0 is speculatively switched into the pipeline assuming a cache hit.)

5. Simultaneous multithreading 5.1 Overview (2) Coarse grain MT Fine grain MT Simultaneous MT (SMT) Thread scheduling in multithreaded cores

5. Simultaneous multithreading Scalar based Simultaneous MT Superscalar based VLIW based Pentium 4 (2002) 2T DEC 21464 (2003) Dual-core/2T IBM POWER5 (2005) Dual-core/2T Pentium EE 840 (2005) Dual-core/2T Pentium EE 955/965 (2005) Dual-core/2T 5.1 Overview (2)

5.2 Case example 1: Intel Pentium 4 / HT (1) Intel designates SMT as Hyperthreading (HT) Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp. (called the Prestonia and Foster MP cores), followed by the Northwood core for desktops in 11/2002. Additions for implementing MT: Duplicated architectural state, including instruction pointer, the general purpose regs., the control regs., the APIC (Advanced Programable Interrupt Controller) regs., some machine state regs.

Figure 5.1. Intel Pentium 4 and the visible processor resources duplicated to support hyperthreading technology. Hyperthreading requires duplication of additional miscellaneous pointers and control logic, but these are too small to point out. Source: Koufaty D. and Marr D.T. „Hyperthreading Technology in the Netburst Microarchitecture, IEEE. Micro, Vol. 23, No.2, March-April 2003, pp. 56-65. 5.2 Case example 1: Intel Pentium 4 / HT (2)

5.2 Case example 1: Intel Pentium 4 / HT (1) Intel designates SMT as Hyperthreading (HT) Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp. (called the Prestonia and Foster MP cores), followed by the Northwood core for desktops in 11/2002. Additions for implementing MT: Duplicated architectural state, including instruction pointer, the general purpose regs., the control regs., the APIC (Advanced Programable Interrupt Controller) regs., some machine state regs. Further enhancements to support MT (thread microstate): TC-entries (Trace cache) are tagged, BHB (Branch History Buffer) is duplicated, Global History Table is tagged, RAS (Return Address Stack) is duplicated, Rename tables are duplicated, ROB is tagged.

5.2 Case example 1: Intel Pentium 4/HT (3) Figure 5.2: SMT pipeline in Intel’s Pentium 4/HT Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and Microarchitecture”, Intel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4-16

5.2 Case example 1: Intel Pentium 4 / HT (1) Intel designates SMT as Hyperthreading (HT) Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp. (called the Prestonia and Foster MP cores), followed by the Northwood core for desktops in 11/2002. Additions for implementing MT: Duplicated architectural state, including instruction pointer, the general purpose regs., the control regs., the APIC (Advanced Programable Interrupt Controller) regs., some machine state regs. Further enhancements to support MT (thread microstate): TC-entries (Trace cache) are tagged, BHB (Branch History Buffer) is duplicated, Global History Table is tagged, RAS (Return Address Stack) is duplicated, Rename tables are duplicated, ROB is tagged. Moore chip area required for MT: less than 5 %. Single thread/dual thread modes: To prevent single thread performance degradation: in single thred mode partitioned resources are recombined.

5.2 Case example 2: Alpha 21464 (V8) (1) Alpha 21264 Alpha 21464 GPRs FPRs 80 Core enhancements for 4-way multithreading: Providing replicated (4 x) thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): 8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line. 512 Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243 In 2001 all Alpha intellectual property rights were sold to Intel.

5.2 Case example 2: Alpha 21464 (V8) (2) Figure 5.3: SMT pipeline in the Alpha 21464 (V8) Source: Mukkherjee S., „The Alpha 21364 and 21464 Microprocessors,” http://www.compaq.com

5.2 Case example 2: Alpha 21464 (V8) (1) Alpha 21264 Alpha 21464 GPRs FPRs 80 Core enhancements for 4-way multithreading: Providing replicated (4 x) thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): Providing replicated (4 x) thread microstates for: Register Maps, 8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line. 512 Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243 In 2001 all Alpha intellectual property rights were sold to Intel.

5.2 Case example 2: Alpha 21464 (V8) (2) Figure 5.3: SMT pipeline in the Alpha 21464 (V8) Source: Mukkherjee S., „The Alpha 21364 and 21464 Microprocessors,” http://www.compaq.com

5.2 Case example 2: Alpha 21464 (V8) (1) Alpha 21264 Alpha 21464 GPRs FPRs 80 Core enhancements for 4-way multithreading: Providing replicated (4 x) thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): Providing replicated (4 x) thread microstates for: Register Maps, Additional core area needed for SMT: ~ 6 % 8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line. 512 Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243 In 2001 all Alpha intellectual property rights were sold to Intel.

5.2 Case example 3: IBM POWER5 (1) POWER5 enhancements vs the POWER4: on-chip memory control,

Figure 5.14: POWER4 and POWER5 system structures Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47. Fabric Controller 5.2 Case example 3: IBM POWER5 (2)

5.2 Case example 3: IBM POWER5 (1) POWER5 enhancements vs the POWER4: on-chip memory control, separate L3/memory attachment,

Figure 5.14: POWER4 and POWER5 system structures Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47. Fabric Controller 5.2 Case example 3: IBM POWER5 (2)

5.2 Case example 3: IBM POWER5 (1) POWER5 enhancements vs the POWER4: on-chip memory control, separate L3/memory attachment, dual threaded.

5.2 Case example 3: IBM POWER5 (3) Figure 5.3: Microarchitecture of IBM’s POWER5 Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003

5.2 Case example 3: IBM POWER5 (4) Figure 5.3: IBM POWER5 Chip Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003

5.2 Case example 3: IBM POWER5 (5) POWER4 POWER5 GPRs FPRs 80 120 72 120 Core enhancements for multithreading: Providing duplicated thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files):

5.2 Case example 3: IBM POWER5 (6) Figure 5.3: SMT pipeline of IBM’s POWER5 Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003

5.2 Case example 3: IBM POWER5 (5) POWER4 POWER5 GPRs FPRs 80 120 72 120 Core enhancements for multithreading: Providing duplicated thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): Providing duplicated thread microstates for: Return Address Stack, Group Completion (ROB)

5.2 Case example 3: IBM POWER5 (5) POWER4 POWER5 GPRs FPRs 80 120 72 120 Core enhancements for multithreading: Providing duplicated thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): Providing duplicated thread microstates for: Return Address Stack, Group Completion (ROB) Providing increased (duplicated) size for scarce or sensitive resorces, such as: Instruction Buffer, Store Queue

5.2 Case example 3: IBM POWER5 (5) POWER4 POWER5 GPRs FPRs 80 120 72 120 Core enhancements for multithreading: Providing duplicated thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): Providing duplicated thread microstates for: Return Address Stack, Group Completion (ROB) Additional core area needed for SMT: ~ 10 % Providing increased (duplicated) size for scarce or sensitive resorces, such as: Instruction Buffer, Store Queue

5.2 Case example 3: IBM POWER5 (7) Unbalanced execution of threads: (an enhancement of the single mode/dual mode thred execution model) Threads have 8 priority levels (0...7) controlled by HW/SW, the decode rate of each thread will be controlled according to the associated priority Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003 Figure 5.3: Unbalanced execution of threads in IBM’s POWER5

5.2 Case example 3: IBM POWER5 (8) Development effort: Concept phase: ~ 10 persons/ 4 month High level design phase: ~ 50 persons/ 6 month Implementation phase: ~ 200 persons/ 12-18 month Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003

12. Multithreaded Processors Dezső Sima Fall 2006  D. Sima, 2006.

Similar presentations

Presentation on theme: "12. Multithreaded Processors Dezső Sima Fall 2006  D. Sima, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

12. Multithreaded Processors Dezső Sima Fall 2006  D. Sima, 2006.

Similar presentations

Presentation on theme: "12. Multithreaded Processors Dezső Sima Fall 2006  D. Sima, 2006."— Presentation transcript:

Similar presentations

About project

Feedback