CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Advertisements

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.
CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.
SMT Parallel Applications –For one program, parallel executing threads Multiprogrammed Applications –For multiple programs, independent threads.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECC722 - Shaaban #1 Lec # 2 Fall Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
How Multi-threading can increase on-chip parallelism
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
CS Lecture 25 Wire Delay is not a Problem for SMT Z. Chishti, T.N. Vijaykumar Proceedings of ISCA-31 June, 2004.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Computer Architecture: Multithreading (II) Prof. Onur Mutlu Carnegie Mellon University.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CS Lecture 20 The Case for a Single-Chip Multiprocessor
Lecture 18: Core Design, Parallel Algos
ECE Dept., Univ. Maryland, College Park
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Simultaneous Multithreading
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Levels of Parallelism within a Single Processor
Lecture: SMT, Cache Hierarchies
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture: SMT, Cache Hierarchies
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 21: Synchronization & Consistency
Lecture 22: Multithreading
Presentation transcript:

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995

Processor Under-Utilization Wide gap between average processor utilization and peak processor utilization Caused by dependences, long latency instrs, branch mispredicts Results in many idle cycles for many structures

Superscalar Utilization Time Resources (e.g. FUs) Suffers from horizontal waste (can’t find enough work in a cycle) and vertical waste (because of dependences, there is nothing to do for many cycles) Utilization=19% vertical:horizontal waste = 61:39 Thread-1 V waste H waste

Chip Multiprocessors Time Resources (e.g. FUs) Single-thread performance goes down Horizontal waste reduces Thread-1 V waste H waste Thread-2

Fine-Grain Multithreading Time Resources (e.g. FUs) Low-cost context-switch at a fine grain Reduces vertical waste Thread-1 V waste H waste Thread-2

Simultaneous Multithreading Time Resources (e.g. FUs) Reduces vertical and horizontal waste Thread-1 V waste H waste Thread-2

Pipeline Structure Front End Front End Front End Front End Execution Engine RenameROB I-CacheBpred RegsIQ FUsDCache Private/ Shared Front-end Private Front-end Shared Exec Engine What about RAS, LSQ?

Chip Multi-Processor Front End Front End Front End Front End RenameROB I-CacheBpred RegsIQ FUsDCache Private Front-end Private Front-end Private Exec Engine Exec Engine Exec Engine Exec Engine Exec Engine

Clustered SMT Front End Front End Front End Front End Clusters

Evaluated Models Fine-Grained Multithreading Unrestricted SMT Restricted SMT  X-issue: A thread can only issue up to X instrs in a cycle  Limited connection: each thread is tied to a fixed FU

Results SMT nearly eliminates horizontal waste In spite of priorities, single-thread performance degrades (cache contention) Not much difference between private and shared caches – however, with few threads, the private caches go under-utilized

Comparison of Models Bullet

CMP vs. SMT

CS 7810 Lecture 16 Exploiting Choice: Instruction Fetch and Issue on an Implementable SMT Processor D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm Proceedings of ISCA-23 June 1996

New Bottlenecks Instruction fetch has a strong influence on total throughput  if the execution engine is executing at top speed, it is often hungry for new instrs  some threads are more likely to have ready instrs than others – selection becomes important

SMT Processor Multiple PCs Multiple Renames and ROBs Multiple RAS More registers

SMT Overheads Large register file – need at least 256 physical registers to support eight threads  increases cycle time/pipeline depth  increases mispredict penalty  increases bypass complexity  increases register lifetime Results in 2% performance loss

Base Design Front-end is fine-grain multithreaded, rest is SMT Bottlenecks:  Low fetch rate (4.2 instrs/cycle)  IQ is often full, but only half the issue bandwidth is being used

Fetch Efficiency Base case uses RoundRobin.1.8 RR.2.4: fetches four instrs each from two threads  requires a banked organization  requires additional multiplexing logic Increases the chances of finding eight instrs without a taken branch Yields instrs in spite of an I-cache miss RR.2.8: extends RR.2.4 by reading out larger line

Results

Fetch Effectiveness Are we picking the best instructions? IQ-clog: instrs that sit in the issue queue for ages; does it make sense to fetch their dependents? Wrong-path instructions waste issue slots Ideally, we want useful instructions that have short issue queue lifetimes

Fetch Effectiveness Useful instructions: throttle fetch if branch mpred probability is high  confidence, num-branches (BRCOUNT), in-flight window size Short lifetimes: throttle fetch if you encounter a cache miss (MISSCOUNT), give priority to threads that have young instrs (IQPOSN)

ICOUNT ICOUNT: priority is based on number of unissued instrs  everyone gets a share of the issueq Long-latency instructions will not dominate the IQ Threads that have high issue rate will also have high fetch rate In-flight windows are short and wrong-path instrs are minimized Increased fairness  more ready instrs per cycle

Results Thruput has gone from 2.2 (single-thread) to 3.9 (base SMT) to 5.4 (ICOUNT.2.8)

Reducing IQ-clog IQBUF: a buffer before the issue queue ITAG: pre-examine the tags to detect I-cache misses and not waste fetch bandwidth OPT_last and SPEC_last: lower issue priority for speculative instrs These techniques entail overheads and result in minor improvements

Bottleneck Analysis The following are not bottlenecks: issue bandwidth, issue queue size, memory thruput Doubling fetch bandwidth improves thruput by 8% -- there is still room for improvement SMT is more tolerant of branch mpreds: perfect prediction improves 1-thread by 25% and 8-thread by 9% -- no speculation has a similar effect Register file can be a huge bottleneck

IPC vs. Threads vs. Registers

Power and Energy Energy is heavily influenced by “work done” and by execution time  compared to a single-thread machine, SMT does not reduce “work done”, but reduces execution time  reduced energy Same work, less time  higher power!

Title Bullet