Download presentation
Presentation is loading. Please wait.
Published byHarvey Wilkerson Modified over 8 years ago
1
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY
2
AGENDA INTRODUCTION Motivation Types of Parallesim Vertical and Horizontal Wasted Slot Superscalar Processors Multithreading Simultaneous Multithreading The Idea SMT Model Issues: What to Fetch and What to Issue? Caching Performance Analysis Simulation Results Comparision Drawbacks Commercial Examples IBM POWER5 Future Tendincies
3
INTRODUCTION: Motivation Microprocessor Design Optimization Some Focus Areas: 1. Memory latency Increased processor speeds make memory appear further away Longer stalls possible 2. Branch Processing Mispredict more costly as pipeline depth increases resulting in stalls and wasted power Predication drives increased power and larger chip area 3. Execution Unit Utilization 20-25% execution unit utilization common SMT Adresses these areas!
4
INTRODUCTION: Motivation Memory subsystem improvement or increasing system integration is not sufficient for significant performance improvement. Solution: Increase parallelism in all its available form Combine the multiple-issue-per-instruction features of modern superscalar processors With latency-hiding ability of multithreaded architectures
5
INTRODUCTION: Types of Parallesim Bit-level Wider processor datapaths (8,16,32,64…) Word-level (SIMD) Vector processors Multimedia instruction sets (Intel’s MMX and SSE, Sun’s VIS, etc.) Instruction-level Pipelining Superscalar VLIW and EPIC Task and Application-levels Explicit parallel programming Multiple threads Multiple applications
6
INTRODUCTION: Vertical Slot & Horizontal Slot Vertical waste is introduced when the processor issues no instructions in a cycle Horizontal waste is introduced when not all issue slots can be filled in a cycle. %61 of the wasted cycles are vertical waste.
7
INTRODUCTION: Superscalar Issues multiple instructions in each cycle. Typically 4. Several functional units of the same type, e.g. ALUs Dispatcher reads instructions, decides which can run in parallel Limited by instruction dependencies and long- latency operations Effects Horizontal & Vertical Waste Low Utilization even with higher-issue machines; 8 Issue with %20
8
INTRODUCTION: Superscalar Many slots in the execution core are unused.
9
MULTITHREADING Processor is extended with the concept of thread allowing the scheduler to chose instructions from one thread or another at each clock. Two types in thread scheduling: coarse- grain multithreading and fine-grain multithreading. SMT uses both types of Multithreading
10
MULTITHREADING
11
What a processor needs for Multithreading? 1. Processor must be aware of several independent states, one per each thread: Program Counter Register File (and Flags) Memory 2. Either multiple resources in the processor or a fast way to switch across states
12
MULTITHREADING: Coarse - Grain Multithreading Swith between threads only on costly stalls This form of multithreading only hides long latency events. Easy to implement but has large grains
13
MULTITHREADING: Coarse-Grain
14
MULTITHREADING: Fine - Grain Multithreading Context switch the threads on every clock cycle. Occupancy of the execution core is now much higher Hides both long and short latency events Vertical waste are eliminated but horizontal waste is not. If a thread has little or no operations to execute issue slots will be wasted.
15
MULTITHREADING: Fine-Grain
16
Simultaneous Multithreading: Idea Combine Superscalar and Multithreading such that; 1. Issue multiple instructions per cycle – Supercalar 2. Hardware state for several programs/threads – Multithreading So; issue multiple instructions from multiple threads in each cycle
17
Simultaneous Multithreading: Idea
18
Simultaneous Multithreading: Model Extend, replicate and redesign some units of superscalar to achive multithreading Resources replicated State for hardware contexts (registers, PCs) Per thread mechanisms for Pipeline flushing and subroutine returns Per thread identiers for branch target buffer and translation lookaside buffer
19
Simultaneous Multithreading: Model Resources redesigned Instruction fetch unit Processor pipeline Instruction Scheduling Does not require additional hardware Register renaming (same as superscalar)
20
Simultaneous Multithreading: Model SuperScalar Architecture
21
Simultaneous Multithreading: Model Block Diagram
22
Simultaneous Multithreading: Model Instruction Fetch Unit Takes advantage of inter-thread competition Partitioning bandwidth Fetching threads that give maximum local benefit 2.8 fetching Fetch 1 inst. per logical processor, for 2 threads Decode 1 thread till branch/end of cache line, then jump to the other ICount feedback Highest priority to threads with fewest instructions in the decode, renaming, and queue pipeline stages Small hardware addition to track queue lengths
23
Simultaneous Multithreading: Model Register File Each thread has 32 registers Register File: 32 * #threads + rename registers So, larger register file longer access time
24
Simultaneous Multithreading: Model Pipeline Format Superscalar SMT
25
Simultaneous Multithreading: Model Pipeline Format To avoid increase in clock cycle time, SMT pipeline extended to allow 2 cycle register reads and writes 2 cycle reads/writes increase branch misprediction penalty
26
Simultaneous Multithreading: Where to Fetch Where to Fetch Static solutions: Round-robin Each cycle 8 instructions from 1 thread Each cycle 4 instructions from 2 threads, 2 from 4,… Each cycle 8 instructions from 2 threads, and forward as many as possible from #1 then when long latency instruction in #1 pick rest from #2 Dynamic solutions: Check execution queues! Favour threads with minimal # of in-flight branches Favour threads with minimal # of outstanding misses Favour threads with minimal # of in-flight instructions Favour threads with instructions far from queue head
27
Simultaneous Multithreading: What to Issue Not exactly the same as in superscalars… In superscalar: oldest is the best (least speculation, more dependent ones waiting, etc.) In SMT not so clear: branch-speculation level and optimism (cache-hit speculation) vary across threads Based on this the selection strategies: Oldest first Cache-hit speculated last Branch speculated last Branches first… Important result: doesn’t matter too much!
28
Simultaneous Multithreading: Compiler Optimizations Should try to minimize cache interference Latency hiding techniques like speculation should be enhanced Sharing optimization techniques from multiprocessors changed – data sharing is now good
29
Simultaneous Multithreading: Caching Same cache shared among threads Performance degradation due to cache sharing Possibility of cache thrashing
30
PERFORMANCE ANALYSIS Four model is selected Basic Machine is 10 FU, 8 Issue 1. Fine-Grain Multithreading 2. SM:Full Simultaneous Issue: Eight threads compete for each of the issue slots each cycle. 3. SM:Single Issue,SM:Dual Issue, SM:Four Issue: Limit the number of instructions each thread can issue e.g: each thread can issue a maximum of 2 instructions per cycle; therefore, a minimum of 4 threads would be required to fill the 8 issue slots in one cycle. 4. SM:Limited Connection: Each hardware context is directly connected to exactly one of each type of functional unit.
31
PERFORMANCE ANALYSIS
32
PERFORMANCE ANALYSIS: H/W COMPLEXITY
33
COMPARISION SMT vs. Multiprocessing Multiprocessing statically assigns functional units to threads SMT allows threads to expand Using available resources
34
COMPARISION
35
DRAWBACKS Two main drawbacks 1. Single thread perfomance decreases due to the architectural constraints 2. Additional contexts will increase power consumption
36
Commercial Examples Compaq Alpha 21464 (EV8) 4T SMT Project killed June 2001 Intel Pentium IV (Xeon) 2T SMT Availability in 2002 (already there before, but not enabled) 10-30% gains expected Also called as Hyperthreading SUN Ultra IV 2-core CMP, 2T SMT IBM POWER5 Dual processor core 8-way superscalar Simultaneous multithreaded (SMT) core: Up to 2 virtual processors per real processor 24% area growth per core for SMT
37
Commercial Examples: IBM POWER5
38
SMT added to Superscalar Micro-architecture Second Program Counter (PC) added to share I- fetch bandwidth GPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread) Completion logic replicated to track two threads Thread bit added to most address/tag buses
39
Commercial Examples: IBM POWER5
40
Includes; 1. Thread Priority Mechanism: Power Efficiency, 8 levels 2. Dynamic Thread Switching Used if no task ready for second thread to run Allocates all machine resources to one thread Initiated by SW
41
Commercial Examples: IBM POWER5 Dormant thread wakes up on: 1. External interrupt 2. Decrementer interrupt 3. Special instruction from active thread
42
Future Tendincies Simultaneous & Redundantly Threaded Processors(SRT) Increase reliability with fault detection and correction. Run multiple copies of the same programme simultaneously Software Pre-Execution in SMT: In some cases data adress is extremely hard to predict. Prefetching is useless Use an idle thread of SMT for pre-execution. A complete software solution Speculation More techniques on speculation E.g Speculative Data-Driven Multithreading, Threaded Multiple Path Execution, Simultaneous Subordinate Microthreading and Thread Level Speculation
43
REFERANCES "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95. “Simultaneous Multithreading: Present Developments and Future Directions” by Miquel Peric, June 2003 “Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor” by IBM, Aug 2004 “Simultaneous Multithreading: A Platform for Next- Generation Processors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro, October, 1997.
44
Q&A THANKS!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.