Simultaneous Multithreading in Superscalar Processors By Connor Sample
What is Simultaneous Multithreading (SMT)? Describes the ability for a processor to execute multiple instructions from multiple distinct threads at the same time. Goal: Increased processor throughput as well as optimized utilization of system resources. This handles the two major bottlenecks modern processor have: Long latency Per-thread parallelism Utilized on superscalar processors
Superscalar processors Looks to execute multiple instructions from a single thread during a cycle This can lead to a loss of cycle resources, as the processor is tied up Typically pipelined, though the two are separate performance enhancement techniques: Superscalar: Multiple instructions, single execution unit Pipelining: Multiple instructions, multiple execution units Processor board from a CRAY T3e supercomputer
Superscalar processors + SMT Regular superscalar processor Multithreaded Architecture SMT Architecture
Architecture SMT architecture draws directly from traditional superscalar processors: The fetch unit grabs as many instructions as is allowed Instructions are decoded and passed to a register renaming unit, which maps local registers to a physical register pulled from a set of unused registers to eliminate false dependency Instructions end up in one of two queues to await issuing Instructions are issued whenever their needed resources become available Once an instruction is complete it is retired, and the associated register is freed for use by another instruction
Architecture
Architectural Additions Some additions are needed to this architecture to enable SMT: Extra program counters Mechanism for the fetch unit to select a specific program counter Per-thread instruction retirement, queue flushing, and catches Larger registers to accommodate increased usage demands
Pipeline Changes Increased register size causes a proportional increase in register access times. This is address by allocating an extra cycle to register operations: First cycle moves register info closer to final destination Second cycle issues instructions and associated values to be executed
Advantages Increased throughput Increased power efficiency Architecture changes minimal and simple to implement Performance gains will increase with improvements in technology
Disadvantages Potential bottlenecks for performance: Fetching: Fetching bandwidth becomes an issue with larger multithreading sizes Memory: Memory throughput is unaffected by the multithreading Registers: Increased register sizes decreases access speed, and unused registers could contribute to bogging the process down Performance gains are dependent on program’s ability to utilize it, meaning developers have to specifically build with SMT in mind. If SMT proves is a detriment to program performance, developers will have to code features to disable it.