Simultaneous Multithreading Pratyusa Manadhata (pratyus@cs) Vyas Sekar(vyass@cs) Carnegie Mellon, 15740 Fall 03
References Susan Eggers, Joel Emer, Henry Levy, Jack Lo, Rebecca Stamm, and Dean Tullsen. Simultaneous Multithreading: A Platform for Next-generation Processors, in IEEE Micro, September/October 1997, pages 12-18. Jack Lo, Susan Eggers, Joel Emer, Henry Levy, Rebecca Stamm, and Dean Tullsen. Converting Thread-Level Parallelism Into Instruction-Level Parallelism via Simultaneous Multithreading, in ACM Transactions on Computer Systems, August 1997, pages 322-354. Dean Tullsen, Susan Eggers, Joel Emer, Henry Levy, Jack Lo, and Rebecca Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , in Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. Carnegie Mellon, 15740 Fall 03
Motivation For significant performance improvement, improving memory subsystem or increasing system integration not sufficient. So increase parallelism in all its available form Instruction Level Parallelism (ILP) Thread Level Parallelism (TLP) Carnegie Mellon, 15740 Fall 03
Architectural Alternatives Superscalar Multithreaded Super scalar MultiProcessors Neither superscalar or SMP can capture ILP/TLP in its entirety Incapable of adapting to dynamic levels of ILP, and TLP Carnegie Mellon, 15740 Fall 03
Simultaneous Multithreading TLP from either multithreaded parallel programs or from multiprogramming workload ILP from each thread Characteristics of SMT processors: from superscalar: issue multiple instructions per cycle from multithreaded: h/w state for multiple threads Carnegie Mellon, 15740 Fall 03
Superscalar Issue slots SMT Multithreaded Carnegie Mellon, 15740 Fall 03
Comparison Superscalar: Multithreaded: SMT : looks at multiple instructions from same process, both horizontal and vertical waste. Multithreaded: minimizes vertical waste: tolerate long latency operations SMT : Selects instructions from any "ready" thread Carnegie Mellon, 15740 Fall 03
SMT Model Minimal extension of superscalar processor Changes in IF stage and register files only No static partitioning of resources Most of the hardware is still available to a single thread. Carnegie Mellon, 15740 Fall 03
SMT Model Per thread Large register file State for hardware context (PC, registers) Instruction retirement, trapping, subroutine return Per thread id in BTB and TLB I cache port Large register file No of physical registers = 8 * 32 + registers for renaming Longer access time Carnegie Mellon, 15740 Fall 03
Pipeline superscalar SMT Carnegie Mellon, 15740 Fall 03
Fetch Mechanism (2.8 scheme) Select 2 threads not incurring I cache miss, read 8 instructions from each thread. Choose as many possible from first thread and rest from the second, upto 8. Alternative – 1.8, 2.4, 4.2 Carnegie Mellon, 15740 Fall 03
I Count Which thread to fetch from threads that have least number of instructions in the decode, rename and queue pipeline stages. even distribution, prevents starvation Carnegie Mellon, 15740 Fall 03
Results/Observations Superscalars: approximately give an IPC of about 1-2 SMT: significantly higher than the values reported for superscalar Longer latency for a single thread? Why? not a significant performance effect Carnegie Mellon, 15740 Fall 03
Results/Observations… SMT absorbs additional conflicts: greater ability to hide latency by using multiple issues from multiple threads. SMP MP2 and MP4 hindered by static resource partitioning SMT dynamically partitions resources among threads Carnegie Mellon, 15740 Fall 03
Results/Observations.. Multithreading can increase cache misses/conflicts More memory requirement More stress on branch prediction h/w Impact on program performance is not significant -> SMT + h/w + compiler opts can hide latency Carnegie Mellon, 15740 Fall 03
Future Directions Each processor in an SMP can use SMT Next generation architectures: SMP on chip instead of wider superscalars Is the performance gain adequate with the additional resource cost Processor Cycle Design Time: Cost vs Performance Writing optimizing Compilers to take advantage of SMT. OS support for thread scheduling, thread priority etc Carnegie Mellon, 15740 Fall 03
Q & A ? Carnegie Mellon, 15740 Fall 03
Thank You. Carnegie Mellon, 15740 Fall 03