EECC722 - Shaaban #1 Lec # 2 Fall 2010 9-6-2010 Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 6: Multicore Systems
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
CMPE750 - Shaaban #1 Lec # 2 Spring Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
Simultaneous Multithreading (SMT)
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1996.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Pipelining and Parallelism Mark Staveley
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
EKT303/4 Superscalar vs Super-pipelined.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
PipeliningPipelining Computer Architecture (Fall 2006)
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Use of Pipelining to Achieve CPI < 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
COMP 740: Computer Architecture and Implementation
Simultaneous Multithreading
Lynn Choi School of Electrical Engineering
Simultaneous Multithreading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
What if dynamic branch prediction is wrong?
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
Presentation transcript:

EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors ( superscalars ). SMT has the potential of greatly enhancing superscalar processor computational capabilities by: –Exploiting thread-level parallelism (TLP) in a single processor core, simultaneously issuing, executing and retiring instructions from different threads during the same cycle. A single physical SMT processor core acts as a number of logical processors each executing a single thread –Providing multiple hardware contexts, hardware thread scheduling and context switching capability. –Providing effective long latency hiding. e.g FP, branch misprediction, memory latency

EECC722 - Shaaban #2 Lec # 2 Fall SMT Issues SMT CPU performance gain potential. Modifications to Superscalar CPU architecture to support SMT. SMT performance evaluation vs. Fine-grain multithreading, Superscalar, Chip Multiprocessors. Hardware techniques to improve SMT performance: –Optimal level one cache configuration for SMT. –SMT thread instruction fetch, issue policies. –Instruction recycling (reuse) of decoded instructions. Software techniques: –Compiler optimizations for SMT. –Software-directed register deallocation. –Operating system behavior and optimization. SMT support for fine-grain synchronization. SMT as a viable architecture for network processors. Current SMT implementation: Intel’s Hyper-Threading (2-way SMT) Microarchitecture and performance in compute-intensive workloads. Ref. Papers SMT-1, SMT-2 SMT-3 SMT-7 SMT-4 SMT-8SMT-9

EECC722 - Shaaban #3 Lec # 2 Fall Evolution of Microprocessors Source: John P. Chen, Intel Labs Pipelined (single issue) Multi-cycle Multiple Issue (CPI <1) Superscalar/VLIW/SMT/CMP Single-issue Processor = Scalar Processor Instructions Per Cycle (IPC) = 1/CPI IPC 1 GHz to ???? GHz General Purpose Processors (GPPs) T = I x CPI x C Original (2002) Intel Predictions 15 GHz

EECC722 - Shaaban #4 Lec # 2 Fall Microprocessor Frequency Trend Result: Deeper Pipelines Longer stalls Higher CPI (lowers effective performance per cycle) 1. Frequency used to double each generation 2. Number of gates/clock reduce by 25% 3. Leads to deeper pipelines with more stages (e.g Intel Pentium 4E has 30+ pipeline stages) Realty Check: Clock frequency scaling is slowing down! (Did silicone finally hit the wall?) Pentium(R) Pentium Pro (R) Pentium(R) II MPC , S A A ,000 10, Mhz Gate Delays/ Clock Intel IBM Power PC DEC Gate delays/clock Processor freq scales by 2X per generation Why? 1- Power leakage 2- Clock distribution delays T = I x CPI x C Possible Solutions? - Exploit Thread-Level Parallelism (TLP) at the chip level (SMT/CMP) - Utilize/integrate more-specialized computing elements other than GPPs

EECC722 - Shaaban #5 Lec # 2 Fall Parallelism in Microprocessor VLSI Generations Simultaneous Multithreading SMT: e.g. Intel’s Hyper-threading Chip-Multiprocessors (CMPs) e.g IBM Power 4, 5 Intel Pentium D, Core 2 AMD Athlon 64 X2 Dual Core Opteron Sun UltraSparc T1 (Niagara) Chip-Level Parallel Processing Even more important due to slowing clock rate increase Multiple micro-operations per cycle (multi-cycle non-pipelined) Superscalar /VLIW CPI <1 Single-issue Pipelined CPI =1 Not Pipelined CPI >> 1 (ILP) Single Thread (TLP) Improving microprocessor generation performance by exploiting more levels of parallelism Thread-Level Parallelism (TLP)

EECC722 - Shaaban #6 Lec # 2 Fall Microprocessor Architecture Trends CMPs (SMT) SMT/CMPs e.g. IBM Power5,6,7, Intel Pentium D, Sun Niagara - (UltraSparc T1) Intel Nehalem (Core i7) Single Threaded { (e.g IBM Power 4/5, AMD X2, X3, X4, Intel Core 2) e.g. Intel’s HyperThreading (P4) (Single or Multi-Threaded) General Purpose Processor (GPP) Chip-level TLP

EECC722 - Shaaban #7 Lec # 2 Fall CPU Architecture Evolution: Single Threaded/Issue Pipeline Traditional 5-stage integer pipeline. Increases Throughput: Ideal CPI = 1

EECC722 - Shaaban #8 Lec # 2 Fall CPU Architecture Evolution: Single-Threaded/Superscalar Architectures Fetch, issue, execute, etc. more than one instruction per cycle (CPI < 1). Limited by instruction-level parallelism (ILP). Due to single thread limitations

EECC722 - Shaaban #9 Lec # 2 Fall Empty or wasted issue slots can be defined as either vertical waste or horizontal waste: –Vertical waste is introduced when the processor issues no instructions in a cycle. –Horizontal waste occurs when not all issue slots can be filled in a cycle. Superscalar Architecture Limitations: Issue Slot Waste Classification Example: 4-Issue Superscalar Ideal IPC =4 Ideal CPI =.25 Instructions Per Cycle = IPC = 1/CPI Also applies to VLIW Result of issue slot waste: Actual Performance << Peak Performance

EECC722 - Shaaban #10 Lec # 2 Fall Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages Average IPC= 1.5 instructions/cycle issue rate Sources of Unused Issue Cycles in an 8-issue Superscalar Processor. Processor busy represents the utilized issue slots; all others represent wasted issue slots. 61% of the wasted cycles are vertical waste, the remainder are horizontal waste. Workload: SPEC92 benchmark suite. Ideal Instructions Per Cycle, IPC = 8 Here real IPC about 1.5 Real IPC << Ideal IPC 1.5 << 8 (CPI = 1/8) ~ 81% of issue slots wasted (18.75 % of ideal IPC) (wasted) SMT-1

EECC722 - Shaaban #11 Lec # 2 Fall Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages Superscalar Architecture Limitations : All possible causes of wasted issue slots, and latency-hiding or latency reducing techniques that can reduce the number of cycles wasted by each cause. Main Issue: One Thread leads to limited ILP (cannot fill issue slots) Solution: Exploit Thread Level Parallelism (TLP) within a single microprocessor chip: Simultaneous Multithreaded (SMT) Processor: -The processor issues and executes instructions from a number of threads creating a number of logical processors within a single physical processor e.g. Intel’s HyperThreading (HT), each physical processor executes instructions from two threads AND/OR Chip-Multiprocessors (CMPs): - Integrate two or more complete processor cores on the same chip (die) - Each core runs a different thread (or program) - Limited ILP is still a problem in each core (Solution: combine this approach with SMT) How? SMT-1

EECC722 - Shaaban #12 Lec # 2 Fall VLIW: Intel/HP VLIW: Intel/HP IA-64 Explicitly Parallel Instruction Computing (EPIC) Strengths: –Allows for a high level of instruction parallelism (ILP). –Takes a lot of the dependency analysis out of HW and places focus on smart compilers. Weakness: –Limited by instruction-level parallelism (ILP) in a single thread. –Keeping Functional Units (FUs) busy (control hazards). –Static FUs Scheduling limits performance gains. –Resulting overall performance heavily depends on compiler performance. Advanced CPU Architectures:

EECC722 - Shaaban #13 Lec # 2 Fall Single Chip Multiprocessors (CMPs) Strengths: –Create a single processor block and duplicate. –Exploits Thread-Level Parallelism. –Takes a lot of the dependency analysis out of HW and places focus on smart compilers. Weakness: –Performance within each processor still limited by individual thread performance (ILP). –High power requirements using current VLSI processes. Almost entire processor cores are replicated on chip. May run at lower clock rates to reduce heat/power consumption. Advanced CPU Architectures: e.g IBM Power 4/5, Intel Pentium D, Core Duo, Core 2 (Conroe), Core i7 AMD Athlon 64 X2, X3, X4, Dual/quad Core Opteron Sun UltraSparc T1 (Niagara) … AKA Multi-Core Processors at chip level

EECC722 - Shaaban #14 Lec # 2 Fall Advanced CPU Architectures: Single Chip Multiprocessor (CMP) Or 4-way CMP with n cores

EECC722 - Shaaban #15 Lec # 2 Fall Current Dual-Core Chip-Multiprocessor (CMP) Architectures Single Die Shared L2 Cache Single Die Private Caches Shared System Interface Two Dice – Shared Package Private Caches Private System Interface Cores communicate using shared cache (Lowest communication latency) Examples: IBM POWER4/5 Intel Pentium Core Duo (Yonah), Conroe Sun UltraSparc T1 (Niagara) Quad Core AMD K10 (shared L3 cache) Cores communicate over external Front Side Bus (FSB) (Highest communication latency) Example: Intel Pentium D Cores communicate using on-chip Interconnects (shared system interface) Examples: AMD Dual Core Opteron, Athlon 64 X2 Intel Itanium2 (Montecito) Source: Real World Technologies, FSB On-chip crossbar/switch

EECC722 - Shaaban #16 Lec # 2 Fall Fine-grained or Traditional Multithreaded Processors Multiple hardware contexts (PC, SP, and registers). Only one context or thread issues instructions each cycle. Performance limited by Instruction-Level Parallelism (ILP) within each individual thread: –Can reduce some of the vertical issue slot waste. –No reduction in horizontal issue slot waste. Example Architecture: The Tera Computer System Advanced CPU Architectures:

EECC722 - Shaaban #17 Lec # 2 Fall Fine-grain or Traditional Multithreaded Processors Fine-grain or Traditional Multithreaded Processors The Tera (Cray) Computer System The Tera computer system is a shared memory multiprocessor that can accommodate up to 256 processors. Each Tera processor is fine-grain multithreaded: –Each processor can issue one 3-operation Long Instruction Word (LIW) every 3 ns cycle (333MHz) from among as many as 128 distinct instruction streams (hardware threads), thereby hiding up to 128 cycles (384 ns) of memory latency. –In addition, each stream can issue as many as eight memory references without waiting for earlier ones to finish, further augmenting the memory latency tolerance of the processor. –A stream implements a load/store architecture with three addressing modes and 31 general-purpose 64-bit registers. –The instructions are 64 bits wide and can contain three operations: a memory reference operation (M-unit operation or simply M-op for short), an arithmetic or logical operation (A-op), and a branch or simple arithmetic or logical operation (C-op). Source: From one thread

EECC722 - Shaaban #18 Lec # 2 Fall SMT: Simultaneous Multithreading Multiple Hardware Contexts (or threads) running at the same time (HW context: registers, PC, and SP etc.). A single physical SMT processor core acts (and reports to the operating system) as a number of logical processors each executing a single thread Reduces both horizontal and vertical waste by having multiple threads keeping functional units busy during every cycle. Builds on top of current time-proven advancements in CPU design: superscalar, dynamic scheduling, hardware speculation, dynamic HW branch prediction, multiple levels of cache, hardware pre-fetching etc. Enabling Technology: VLSI logic density in the order of hundreds of millions of transistors/Chip. –Potential performance gain is much greater than the increase in chip area and power consumption needed to support SMT. Improved Performance/Chip Area/Watt (Computational Efficiency) vs. single-threaded superscalar cores. 2-way SMT processor 10-15% increase in area Vs. ~ 100% increase for dual-core CMP

EECC722 - Shaaban #19 Lec # 2 Fall SMT With multiple threads running penalties from long-latency operations, cache misses, and branch mispredictions will be hidden: –Reduction of both horizontal and vertical waste and thus improved Instructions Issued Per Cycle (IPC) rate. Functional units are shared among all contexts during every cycle: –More complicated register read and writeback stages. More threads issuing to functional units results in higher resource utilization. CPU resources may have to resized to accommodate the additional demands of the multiple threads running. –(e.g cache, TLBs, branch prediction tables, rename registers) context = hardware thread Thus SMT is an effective long latency-hiding technique

EECC722 - Shaaban #20 Lec # 2 Fall SMT: Simultaneous Multithreading Modified out-of-order Superscalar Core One n-way SMT Core n Hardware Contexts

EECC722 - Shaaban #21 Lec # 2 Fall The Power Of SMT Time (processor cycles) SuperscalarTraditional Multithreaded Simultaneous Multithreading Rows of squares represent instruction issue slots Box with number x: instruction issued from thread x Empty box: slot is wasted (Fine-grain)

EECC722 - Shaaban #22 Lec # 2 Fall SMT Performance Example InstCodeDescriptionFunctional unit ALUIR5,100R5 = 100Int ALU BFMULF1,F2,F3F1 = F2 x F3FP ALU CADDR4,R4,8R4 = R4 + 8Int ALU DMULR3,R4,R5R3 = R4 x R5Intmul/div ELWR6,R4R6 = (R4)Memory port FADDR1,R2,R3R1 = R2 + R3Int ALU GNOTR7,R7R7 = !R7Int ALU HFADDF4,F1,F2F4=F1 + F2FP ALU IXORR8,R1,R7R8 = R1 XOR R7Int ALU JSUBIR2,R1,4R2 = R1 – 4Int ALU KSWADDR,R2(ADDR) = R2Memory port 4 integer ALUs (1 cycle latency) 1 integer multiplier/divider (3 cycle latency) 3 memory ports (2 cycle latency, assume cache hit) 2 FP ALUs (5 cycle latency) Assume all functional units are fully-pipelined

EECC722 - Shaaban #23 Lec # 2 Fall SMT Performance Example (continued) 2 additional cycles for SMT to complete program 2 Throughput: –Superscalar: 11 inst/7 cycles = 1.57 IPC –SMT: 22 inst/9 cycles = 2.44 IPC –SMT is 2.44/1.57 = 1.55 times faster than superscalar for this example 4-way 2-thread SMT Ideal speedup = 2 i.e 2 nd thread

EECC722 - Shaaban #24 Lec # 2 Fall Modifications to Superscalar CPUs to Support SMT Necessary Modifications: Multiple program counters and some mechanism by which one fetch unit selects one each cycle (thread instruction fetch/issue policy). A separate return stack for each thread for predicting subroutine return destinations. Per-thread instruction issue/retirement, instruction queue flush, and trap mechanisms. A thread id with each branch target buffer entry to avoid predicting phantom branches. Modifications to Improve SMT performance: A larger rename register file, to support logical registers for all threads plus additional registers for register renaming. (may require additional pipeline stages). A higher available main memory fetch bandwidth may be required. Larger data TLB with more entries to compensate for increased virtual to physical address translations. Improved cache to offset the cache performance degradation due to cache sharing among the threads and the resulting reduced locality. –e.g Private per-thread vs. shared L1 cache. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages SMT-2 i.e thread state

EECC722 - Shaaban #25 Lec # 2 Fall SMT Implementations Intel’s implementation of Hyper-Threading (HT) Technology (2-thread SMT): – Originally implemented in its NetBurst microarchitecture (P4 processor family). –Current Hyper-Threading implementation: Intel’s Nehalem (Core i7 – introduced 4 th quarter 2008): 2, 4 or 8 cores per chip each 2-thread SMT (4-16 threads per chip). IBM POWER 5/6: Dual cores each 2-thread SMT. The Alpha EV8 (4-thread SMT) originally scheduled for production in A number of special-purpose processors targeted towards network processor (NP) applications. Sun UltraSparc T1 (Niagara): Eight processor cores each executing from 4 hardware threads (32 threads total). ultithreaded –Actually, not SMT but fine-grain multithreaded (each core issues one instruction from one thread per cycle). Current technology has the potential for 4-8 simultaneous threads per core ( based on transistor count and design complexity).

EECC722 - Shaaban #26 Lec # 2 Fall A Base SMT Hardware Architecture. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages SMT-2 In-Order Front End Out-of-order Core Modified Superscalar Speculative Tomasulo Fetch/Issue

EECC722 - Shaaban #27 Lec # 2 Fall Example SMT Vs. Superscalar Pipeline The pipeline of (a) a conventional superscalar processor and (b) that pipeline modified for an SMT processor, along with some implications of those pipelines. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages Based on the Alpha SMT-2 Two extra pipeline stages added for reg. Read/write to account for the size increase of the register file

EECC722 - Shaaban #28 Lec # 2 Fall Intel Hyper-Threaded (2-way SMT) P4 Processor Pipeline Source: Intel Technology Journal, Volume 6, Number 1, February 2002.Intel Technology Journal SMT-8

EECC722 - Shaaban #29 Lec # 2 Fall Intel P4 Out-of-order Execution Engine Detailed Pipeline Source: Intel Technology Journal, Volume 6, Number 1, February 2002.Intel Technology Journal SMT-8 Hyper-Threaded (2-way SMT)

EECC722 - Shaaban #30 Lec # 2 Fall SMT Performance Comparison Instruction throughput (IPC) from simulations by Eggers et al. at The University of Washington, using both multiprogramming and parallel workloads: Multiprogramming workload Superscalar Traditional SMT Threads Multithreading Parallel Workload SuperscalarMP2MP4 Traditional SMT Threads Multithreading Multiprogramming workload = multiple single threaded programs (multi-tasking) Parallel Workload = Single multi-threaded program (MP = Chip-multiprocessor) i.e Fine-grained IPC

EECC722 - Shaaban #31 Lec # 2 Fall The following machine models for a multithreaded CPU that can issue 8 instruction per cycle differ in how threads use issue slots and functional units: Fine-Grain Multithreading: –Only one thread issues instructions each cycle, but it can use the entire issue width of the processor. This hides all sources of vertical waste, but does not hide horizontal waste. SM:Full Simultaneous Issue: –This is a completely flexible simultaneous multithreaded superscalar: all eight threads compete for each of the 8 issue slots each cycle. This is the least realistic model in terms of hardware complexity, but provides insight into the potential for simultaneous multithreading. The following models each represent restrictions to this scheme that decrease hardware complexity. SM:Single Issue,SM:Dual Issue, and SM:Four Issue: –These three models limit the number of instructions each thread can issue, or have active in the scheduling window, each cycle. –For example, in a SM:Dual Issue processor, each thread can issue a maximum of 2 instructions per cycle; therefore, a minimum of 4 threads would be required to fill the 8 issue slots in one cycle. SM:Limited Connection. –Each hardware context is directly connected to exactly one of each type of functional unit. –For example, if the hardware supports eight threads and there are four integer units, each integer unit could receive instructions from exactly two threads. –The partitioning of functional units among threads is thus less dynamic than in the other models, but each functional unit is still shared (the critical factor in achieving high utilization). Possible Machine Models for an 8-way Multithreaded Processor Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages SMT-1 i.e SM: Eight Issue

EECC722 - Shaaban #32 Lec # 2 Fall Comparison of Multithreaded CPU Models Complexity A comparison of key hardware complexity features of the various models (H=high complexity). The comparison takes into account: –the number of ports needed for each register file, –the dependence checking for a single thread to issue multiple instructions, –the amount of forwarding logic, –and the difficulty of scheduling issued instructions onto functional units. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages SMT-1

EECC722 - Shaaban #33 Lec # 2 Fall Simultaneous Vs. Fine-Grain Multithreading Performance Instruction throughput as a function of the number of threads. (a)-(c) show the throughput by thread priority for particular models, and (d) shows the total throughput for all threads for each of the six machine models. The lowest segment of each bar is the contribution of the highest priority thread to the total throughput. Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages SMT-1 Workload: SPEC92 IPC

EECC722 - Shaaban #34 Lec # 2 Fall Results for the multiprocessor MP vs. simultaneous multithreading SM comparisons.The multiprocessor always has one functional unit of each type per processor. In most cases the SM processor has the same total number of each FU type as the MP. Simultaneous Multithreading (SM) Vs. Single-Chip Multiprocessing (MP) Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages SMT-1

EECC722 - Shaaban #35 Lec # 2 Fall Impact of Level 1 Cache Sharing on SMT Performance Results for the simulated cache configurations, shown relative to the throughput (instructions per cycle) of the 64s.64p The caches are specified as: [total I cache size in KB][private or shared].[D cache size][private or shared] For instance, 64p.64s has eight private 8 KB I caches and a shared 64 KB data Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages SMT-1 Best overall performance of configurations considered achieved by 64s.64s (64K instruction cache shared 64K data cache shared) 64K instruction cache shared 64K data cache private (8K per thread) Instruction Data Notation:

EECC722 - Shaaban #36 Lec # 2 Fall The Impact of Increased Multithreading on Some Low Level Metrics for Base SMT Architecture Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages SMT-2 More threads supported may lead to more demand on hardware resources (e.g here D and I miss rated increased substantially, and thus need to be resized)

EECC722 - Shaaban #37 Lec # 2 Fall Possible SMT Thread Instruction Fetch Scheduling Policies Round Robin: –Instruction from Thread 1, then Thread 2, then Thread 3, etc. (eg RR 1.8 : each cycle one thread fetches up to eight instructions RR 2.4 each cycle two threads fetch up to four instructions each) BR-Count: –Give highest priority to those threads that are least likely to be on a wrong path by by counting branch instructions that are in the decode stage, the rename stage, and the instruction queues, favoring those with the fewest unresolved branches. MISS-Count: –Give priority to those threads that have the fewest outstanding Data cache misses. ICount: –Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline (decode, rename, and the instruction queues). IQPOSN: –Give lowest priority to those threads with instructions closest to the head of either the integer or floating point instruction queues (the oldest instruction is at the head of the queue). Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages SMT-2 Instruction Queue (IQ) Position

EECC722 - Shaaban #38 Lec # 2 Fall Instruction Throughput For Round Robin Instruction Fetch Scheduling Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages SMT-2 Best overall instruction throughput achieved using round robin RR.2.8 (in each cycle two threads each fetch a block of 8 instructions) Workload: SPEC92 RR.2.8 RR with best performance

EECC722 - Shaaban #39 Lec # 2 Fall Instruction throughput & Thread Fetch Policy Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages SMT-2 Workload: SPEC92 All other fetch heuristics provide speedup over round robin Instruction Count ICOUNT.2.8 provides most improvement 5.3 instructions/cycle vs 2.5 for unmodified superscalar. ICOUNT.2.8 ICOUNT: Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline (decode, rename, and the instruction queues).

EECC722 - Shaaban #40 Lec # 2 Fall Low-Level Metrics For Round Robin 2.8, Icount 2.8 ICOUNT improves on the performance of Round Robin by 23% by reducing Instruction Queue (IQ) clog by selecting a better mix of instructions to queue SMT-2

EECC722 - Shaaban #41 Lec # 2 Fall Possible SMT Instruction Issue Policies OLDEST FIRST: Issue the oldest instructions (those deepest into the instruction queue, the default). OPT LAST and SPEC LAST: Issue optimistic and speculative instructions after all others have been issued. BRANCH FIRST: Issue branches as early as possible in order to identify mispredicted branches quickly. Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages SMT-2 Instruction issue bandwidth is not a bottleneck in SMT as shown above ICOUNT.2.8 Fetch policy used for all issue policies above

EECC722 - Shaaban #42 Lec # 2 Fall SMT: Simultaneous Multithreading Strengths: –Overcomes the limitations imposed by low single thread instruction-level parallelism. –Resource-efficient support of chip-level TLP. –Multiple threads running will hide individual control hazards ( i.e branch mispredictions) and other long latencies (i.e main memory access latency on a cache miss). Weaknesses: –Additional stress placed on memory hierarchy. – Control unit complexity. –Sizing of resources (cache, branch prediction, TLBs etc.) –Accessing registers (32 integer + 32 FP for each HW context): Some designs devote two clock cycles for both register reads and register writes. Deeper pipeline