Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

Similar presentations


Presentation on theme: "CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995."— Presentation transcript:

1 CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

2 Processor Under-Utilization Wide gap between average processor utilization and peak processor utilization Caused by dependences, long latency instrs, branch mispredicts Results in many idle cycles for many structures

3 Superscalar Utilization Time Resources (e.g. FUs) Suffers from horizontal waste (can’t find enough work in a cycle) and vertical waste (because of dependences, there is nothing to do for many cycles) Utilization=19% vertical:horizontal waste = 61:39 Thread-1 V waste H waste

4 Chip Multiprocessors Time Resources (e.g. FUs) Single-thread performance goes down Horizontal waste reduces Thread-1 V waste H waste Thread-2

5 Fine-Grain Multithreading Time Resources (e.g. FUs) Low-cost context-switch at a fine grain Reduces vertical waste Thread-1 V waste H waste Thread-2

6 Simultaneous Multithreading Time Resources (e.g. FUs) Reduces vertical and horizontal waste Thread-1 V waste H waste Thread-2

7 Pipeline Structure Front End Front End Front End Front End Execution Engine RenameROB I-CacheBpred RegsIQ FUsDCache Private/ Shared Front-end Private Front-end Shared Exec Engine What about RAS, LSQ?

8 Chip Multi-Processor Front End Front End Front End Front End RenameROB I-CacheBpred RegsIQ FUsDCache Private Front-end Private Front-end Private Exec Engine Exec Engine Exec Engine Exec Engine Exec Engine

9 Clustered SMT Front End Front End Front End Front End Clusters

10 Evaluated Models Fine-Grained Multithreading Unrestricted SMT Restricted SMT  X-issue: A thread can only issue up to X instrs in a cycle  Limited connection: each thread is tied to a fixed FU

11 Results SMT nearly eliminates horizontal waste In spite of priorities, single-thread performance degrades (cache contention) Not much difference between private and shared caches – however, with few threads, the private caches go under-utilized

12 Comparison of Models Bullet

13 CMP vs. SMT

14 CS 7810 Lecture 16 Exploiting Choice: Instruction Fetch and Issue on an Implementable SMT Processor D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm Proceedings of ISCA-23 June 1996

15 New Bottlenecks Instruction fetch has a strong influence on total throughput  if the execution engine is executing at top speed, it is often hungry for new instrs  some threads are more likely to have ready instrs than others – selection becomes important

16 SMT Processor Multiple PCs Multiple Renames and ROBs Multiple RAS More registers

17 SMT Overheads Large register file – need at least 256 physical registers to support eight threads  increases cycle time/pipeline depth  increases mispredict penalty  increases bypass complexity  increases register lifetime Results in 2% performance loss

18 Base Design Front-end is fine-grain multithreaded, rest is SMT Bottlenecks:  Low fetch rate (4.2 instrs/cycle)  IQ is often full, but only half the issue bandwidth is being used

19 Fetch Efficiency Base case uses RoundRobin.1.8 RR.2.4: fetches four instrs each from two threads  requires a banked organization  requires additional multiplexing logic Increases the chances of finding eight instrs without a taken branch Yields instrs in spite of an I-cache miss RR.2.8: extends RR.2.4 by reading out larger line

20 Results

21 Fetch Effectiveness Are we picking the best instructions? IQ-clog: instrs that sit in the issue queue for ages; does it make sense to fetch their dependents? Wrong-path instructions waste issue slots Ideally, we want useful instructions that have short issue queue lifetimes

22 Fetch Effectiveness Useful instructions: throttle fetch if branch mpred probability is high  confidence, num-branches (BRCOUNT), in-flight window size Short lifetimes: throttle fetch if you encounter a cache miss (MISSCOUNT), give priority to threads that have young instrs (IQPOSN)

23 ICOUNT ICOUNT: priority is based on number of unissued instrs  everyone gets a share of the issueq Long-latency instructions will not dominate the IQ Threads that have high issue rate will also have high fetch rate In-flight windows are short and wrong-path instrs are minimized Increased fairness  more ready instrs per cycle

24 Results Thruput has gone from 2.2 (single-thread) to 3.9 (base SMT) to 5.4 (ICOUNT.2.8)

25 Reducing IQ-clog IQBUF: a buffer before the issue queue ITAG: pre-examine the tags to detect I-cache misses and not waste fetch bandwidth OPT_last and SPEC_last: lower issue priority for speculative instrs These techniques entail overheads and result in minor improvements

26 Bottleneck Analysis The following are not bottlenecks: issue bandwidth, issue queue size, memory thruput Doubling fetch bandwidth improves thruput by 8% -- there is still room for improvement SMT is more tolerant of branch mpreds: perfect prediction improves 1-thread by 25% and 8-thread by 9% -- no speculation has a similar effect Register file can be a huge bottleneck

27 IPC vs. Threads vs. Registers

28 Power and Energy Energy is heavily influenced by “work done” and by execution time  compared to a single-thread machine, SMT does not reduce “work done”, but reduces execution time  reduced energy Same work, less time  higher power!

29 Title Bullet


Download ppt "CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995."

Similar presentations


Ads by Google