Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.

Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY

AGENDA  INTRODUCTION Motivation Types of Parallesim Vertical and Horizontal Wasted Slot Superscalar Processors  Multithreading  Simultaneous Multithreading The Idea SMT Model Issues: What to Fetch and What to Issue? Caching  Performance Analysis Simulation Results Comparision Drawbacks  Commercial Examples IBM POWER5  Future Tendincies

INTRODUCTION: Motivation  Microprocessor Design Optimization Some Focus Areas: 1. Memory latency Increased processor speeds make memory appear further away Longer stalls possible 2. Branch Processing Mispredict more costly as pipeline depth increases resulting in stalls and wasted power Predication drives increased power and larger chip area 3. Execution Unit Utilization 20-25% execution unit utilization common SMT Adresses these areas!

INTRODUCTION: Motivation  Memory subsystem improvement or increasing system integration is not sufficient for significant performance improvement.  Solution: Increase parallelism in all its available form  Combine the multiple-issue-per-instruction features of modern superscalar processors  With latency-hiding ability of multithreaded architectures

INTRODUCTION: Types of Parallesim  Bit-level Wider processor datapaths (8,16,32,64…)  Word-level (SIMD) Vector processors Multimedia instruction sets (Intel’s MMX and SSE, Sun’s VIS, etc.)  Instruction-level Pipelining Superscalar VLIW and EPIC  Task and Application-levels Explicit parallel programming Multiple threads Multiple applications

INTRODUCTION: Vertical Slot & Horizontal Slot  Vertical waste is introduced when the processor issues no instructions in a cycle  Horizontal waste is introduced when not all issue slots can be filled in a cycle.  %61 of the wasted cycles are vertical waste.

INTRODUCTION: Superscalar  Issues multiple instructions in each cycle. Typically 4.  Several functional units of the same type, e.g. ALUs  Dispatcher reads instructions, decides which can run in parallel  Limited by instruction dependencies and long- latency operations  Effects Horizontal & Vertical Waste  Low Utilization even with higher-issue machines; 8 Issue with %20

INTRODUCTION: Superscalar  Many slots in the execution core are unused.

MULTITHREADING  Processor is extended with the concept of thread allowing the scheduler to chose instructions from one thread or another at each clock.  Two types in thread scheduling: coarse- grain multithreading and fine-grain multithreading.  SMT uses both types of Multithreading

MULTITHREADING

 What a processor needs for Multithreading? 1. Processor must be aware of several independent states, one per each thread: Program Counter Register File (and Flags) Memory 2. Either multiple resources in the processor or a fast way to switch across states

MULTITHREADING: Coarse - Grain Multithreading  Swith between threads only on costly stalls  This form of multithreading only hides long latency events.  Easy to implement but has large grains

MULTITHREADING: Coarse-Grain

MULTITHREADING: Fine - Grain Multithreading  Context switch the threads on every clock cycle.  Occupancy of the execution core is now much higher  Hides both long and short latency events  Vertical waste are eliminated but horizontal waste is not. If a thread has little or no operations to execute issue slots will be wasted.

MULTITHREADING: Fine-Grain

Simultaneous Multithreading: Idea  Combine Superscalar and Multithreading such that; 1. Issue multiple instructions per cycle – Supercalar 2. Hardware state for several programs/threads – Multithreading  So; issue multiple instructions from multiple threads in each cycle

Simultaneous Multithreading: Idea

Simultaneous Multithreading: Model  Extend, replicate and redesign some units of superscalar to achive multithreading  Resources replicated State for hardware contexts (registers, PCs) Per thread mechanisms for Pipeline flushing and subroutine returns Per thread identiers for branch target buffer and translation lookaside buffer

Simultaneous Multithreading: Model  Resources redesigned Instruction fetch unit Processor pipeline  Instruction Scheduling Does not require additional hardware Register renaming (same as superscalar)

Simultaneous Multithreading: Model SuperScalar Architecture

Simultaneous Multithreading: Model Block Diagram

Simultaneous Multithreading: Model  Instruction Fetch Unit Takes advantage of inter-thread competition  Partitioning bandwidth  Fetching threads that give maximum local benefit 2.8 fetching  Fetch 1 inst. per logical processor, for 2 threads  Decode 1 thread till branch/end of cache line, then jump to the other ICount feedback  Highest priority to threads with fewest instructions in the decode, renaming, and queue pipeline stages  Small hardware addition to track queue lengths

Simultaneous Multithreading: Model  Register File Each thread has 32 registers Register File: 32 * #threads + rename registers So, larger register file longer access time

Simultaneous Multithreading: Model Pipeline Format  Superscalar  SMT

Simultaneous Multithreading: Model Pipeline Format  To avoid increase in clock cycle time, SMT pipeline extended to allow 2 cycle register reads and writes  2 cycle reads/writes increase branch misprediction penalty

Simultaneous Multithreading: Where to Fetch  Where to Fetch  Static solutions: Round-robin Each cycle 8 instructions from 1 thread Each cycle 4 instructions from 2 threads, 2 from 4,… Each cycle 8 instructions from 2 threads, and forward as many as possible from #1 then when long latency instruction in #1 pick rest from #2  Dynamic solutions: Check execution queues! Favour threads with minimal # of in-flight branches Favour threads with minimal # of outstanding misses Favour threads with minimal # of in-flight instructions Favour threads with instructions far from queue head

Simultaneous Multithreading: What to Issue  Not exactly the same as in superscalars… In superscalar: oldest is the best (least speculation, more dependent ones waiting, etc.) In SMT not so clear: branch-speculation level and optimism (cache-hit speculation) vary across threads  Based on this the selection strategies: Oldest first Cache-hit speculated last Branch speculated last Branches first…  Important result: doesn’t matter too much!

Simultaneous Multithreading: Compiler Optimizations  Should try to minimize cache interference  Latency hiding techniques like speculation should be enhanced  Sharing optimization techniques from multiprocessors changed – data sharing is now good

Simultaneous Multithreading: Caching  Same cache shared among threads  Performance degradation due to cache sharing  Possibility of cache thrashing

PERFORMANCE ANALYSIS  Four model is selected  Basic Machine is 10 FU, 8 Issue 1. Fine-Grain Multithreading 2. SM:Full Simultaneous Issue: Eight threads compete for each of the issue slots each cycle. 3. SM:Single Issue,SM:Dual Issue, SM:Four Issue: Limit the number of instructions each thread can issue e.g: each thread can issue a maximum of 2 instructions per cycle; therefore, a minimum of 4 threads would be required to fill the 8 issue slots in one cycle. 4. SM:Limited Connection: Each hardware context is directly connected to exactly one of each type of functional unit.

PERFORMANCE ANALYSIS

PERFORMANCE ANALYSIS: H/W COMPLEXITY

COMPARISION  SMT vs. Multiprocessing  Multiprocessing statically assigns functional units to threads  SMT allows threads to expand Using available resources

COMPARISION

DRAWBACKS  Two main drawbacks 1. Single thread perfomance decreases due to the architectural constraints 2. Additional contexts will increase power consumption

Commercial Examples  Compaq Alpha 21464 (EV8) 4T SMT Project killed June 2001  Intel Pentium IV (Xeon) 2T SMT Availability in 2002 (already there before, but not enabled) 10-30% gains expected Also called as Hyperthreading  SUN Ultra IV 2-core CMP, 2T SMT  IBM POWER5 Dual processor core 8-way superscalar Simultaneous multithreaded (SMT) core: Up to 2 virtual processors per real processor 24% area growth per core for SMT

Commercial Examples: IBM POWER5

 SMT added to Superscalar Micro-architecture  Second Program Counter (PC) added to share I- fetch bandwidth  GPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread)  Completion logic replicated to track two threads  Thread bit added to most address/tag buses

Commercial Examples: IBM POWER5

 Includes; 1. Thread Priority Mechanism: Power Efficiency, 8 levels 2. Dynamic Thread Switching Used if no task ready for second thread to run Allocates all machine resources to one thread Initiated by SW

Commercial Examples: IBM POWER5  Dormant thread wakes up on: 1. External interrupt 2. Decrementer interrupt 3. Special instruction from active thread

Future Tendincies  Simultaneous & Redundantly Threaded Processors(SRT) Increase reliability with fault detection and correction. Run multiple copies of the same programme simultaneously  Software Pre-Execution in SMT: In some cases data adress is extremely hard to predict. Prefetching is useless Use an idle thread of SMT for pre-execution. A complete software solution  Speculation More techniques on speculation E.g Speculative Data-Driven Multithreading, Threaded Multiple Path Execution, Simultaneous Subordinate Microthreading and Thread Level Speculation

REFERANCES  "Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy in ISCA95.  “Simultaneous Multithreading: Present Developments and Future Directions” by Miquel Peric, June 2003  “Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor” by IBM, Aug 2004  “Simultaneous Multithreading: A Platform for Next- Generation Processors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro, October, 1997.

Q&A THANKS!

Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.

Similar presentations

Presentation on theme: "Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.

Similar presentations

Presentation on theme: "Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY."— Presentation transcript:

Similar presentations

About project

Feedback