Download presentation
Presentation is loading. Please wait.
Published byLora Wade Modified over 9 years ago
1
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995
2
Processor Under-Utilization Wide gap between average processor utilization and peak processor utilization Caused by dependences, long latency instrs, branch mispredicts Results in many idle cycles for many structures
3
Superscalar Utilization Time Resources (e.g. FUs) Suffers from horizontal waste (can’t find enough work in a cycle) and vertical waste (because of dependences, there is nothing to do for many cycles) Utilization=19% vertical:horizontal waste = 61:39 Thread-1 V waste H waste
4
Chip Multiprocessors Time Resources (e.g. FUs) Single-thread performance goes down Horizontal waste reduces Thread-1 V waste H waste Thread-2
5
Fine-Grain Multithreading Time Resources (e.g. FUs) Low-cost context-switch at a fine grain Reduces vertical waste Thread-1 V waste H waste Thread-2
6
Simultaneous Multithreading Time Resources (e.g. FUs) Reduces vertical and horizontal waste Thread-1 V waste H waste Thread-2
7
Pipeline Structure Front End Front End Front End Front End Execution Engine RenameROB I-CacheBpred RegsIQ FUsDCache Private/ Shared Front-end Private Front-end Shared Exec Engine What about RAS, LSQ?
8
Chip Multi-Processor Front End Front End Front End Front End RenameROB I-CacheBpred RegsIQ FUsDCache Private Front-end Private Front-end Private Exec Engine Exec Engine Exec Engine Exec Engine Exec Engine
9
Clustered SMT Front End Front End Front End Front End Clusters
10
Evaluated Models Fine-Grained Multithreading Unrestricted SMT Restricted SMT X-issue: A thread can only issue up to X instrs in a cycle Limited connection: each thread is tied to a fixed FU
11
Results SMT nearly eliminates horizontal waste In spite of priorities, single-thread performance degrades (cache contention) Not much difference between private and shared caches – however, with few threads, the private caches go under-utilized
12
Comparison of Models Bullet
13
CMP vs. SMT
14
CS 7810 Lecture 16 Exploiting Choice: Instruction Fetch and Issue on an Implementable SMT Processor D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm Proceedings of ISCA-23 June 1996
15
New Bottlenecks Instruction fetch has a strong influence on total throughput if the execution engine is executing at top speed, it is often hungry for new instrs some threads are more likely to have ready instrs than others – selection becomes important
16
SMT Processor Multiple PCs Multiple Renames and ROBs Multiple RAS More registers
17
SMT Overheads Large register file – need at least 256 physical registers to support eight threads increases cycle time/pipeline depth increases mispredict penalty increases bypass complexity increases register lifetime Results in 2% performance loss
18
Base Design Front-end is fine-grain multithreaded, rest is SMT Bottlenecks: Low fetch rate (4.2 instrs/cycle) IQ is often full, but only half the issue bandwidth is being used
19
Fetch Efficiency Base case uses RoundRobin.1.8 RR.2.4: fetches four instrs each from two threads requires a banked organization requires additional multiplexing logic Increases the chances of finding eight instrs without a taken branch Yields instrs in spite of an I-cache miss RR.2.8: extends RR.2.4 by reading out larger line
20
Results
21
Fetch Effectiveness Are we picking the best instructions? IQ-clog: instrs that sit in the issue queue for ages; does it make sense to fetch their dependents? Wrong-path instructions waste issue slots Ideally, we want useful instructions that have short issue queue lifetimes
22
Fetch Effectiveness Useful instructions: throttle fetch if branch mpred probability is high confidence, num-branches (BRCOUNT), in-flight window size Short lifetimes: throttle fetch if you encounter a cache miss (MISSCOUNT), give priority to threads that have young instrs (IQPOSN)
23
ICOUNT ICOUNT: priority is based on number of unissued instrs everyone gets a share of the issueq Long-latency instructions will not dominate the IQ Threads that have high issue rate will also have high fetch rate In-flight windows are short and wrong-path instrs are minimized Increased fairness more ready instrs per cycle
24
Results Thruput has gone from 2.2 (single-thread) to 3.9 (base SMT) to 5.4 (ICOUNT.2.8)
25
Reducing IQ-clog IQBUF: a buffer before the issue queue ITAG: pre-examine the tags to detect I-cache misses and not waste fetch bandwidth OPT_last and SPEC_last: lower issue priority for speculative instrs These techniques entail overheads and result in minor improvements
26
Bottleneck Analysis The following are not bottlenecks: issue bandwidth, issue queue size, memory thruput Doubling fetch bandwidth improves thruput by 8% -- there is still room for improvement SMT is more tolerant of branch mpreds: perfect prediction improves 1-thread by 25% and 8-thread by 9% -- no speculation has a similar effect Register file can be a huge bottleneck
27
IPC vs. Threads vs. Registers
28
Power and Energy Energy is heavily influenced by “work done” and by execution time compared to a single-thread machine, SMT does not reduce “work done”, but reduces execution time reduced energy Same work, less time higher power!
29
Title Bullet
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.