EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”

Slides:



Advertisements
Similar presentations
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Advertisements

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©
EECE476: Computer Architecture Lecture 21: Faster Branches Branch Prediction with Branch-Target Buffers (not in textbook) The University of British ColumbiaEECE.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
EECE476: Computer Architecture Lecture 11: Understanding and Assessing Performance Chapter 4.1, 4.2 The University of British ColumbiaEECE 476© 2005 Guy.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
Pipelining to Superscalar Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Pipelining and Parallelism Mark Staveley
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
EKT303/4 Superscalar vs Super-pipelined.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Simultaneous Multithreading CMPE 511 BOĞAZİÇİ UNIVERSITY.
PipeliningPipelining Computer Architecture (Fall 2006)
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
Simultaneous Multithreading
Simultaneous Multithreading
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
ECE/CS 552: Pipelining to Superscalar
Lecture: SMT, Cache Hierarchies
Hardware Multithreading
CSCE 432/832 High Performance Processor Architectures Scalar to Superscalar Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,
Lecture: SMT, Cache Hierarchies
Simultaneous Multithreading in Superscalar Processors
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Pipelining to Superscalar ECE/CS 752 Fall 2017
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture: SMT, Cache Hierarchies
Instruction Flow Techniques
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
ECE/CS 552: Pipelining to Superscalar
Lecture: SMT, Cache Hierarchies
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 22: Multithreading
Presentation transcript:

EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor” by Tullsen, Eggers, Emer, Levy, Lo, Stamm) The University of British ColumbiaEECE 476© 2005 Guy Lemieux

2 The Speed Limit? Weiss and Smith [1984]1.58  1.6 instr/cycle  Sohi and Vajapeyam [1987]1.81 Tjaden and Flynn [1970]1.86 Tjaden and Flynn [1973]1.96 Uht [1986]2.00 Smith et al. [1989]2.00 Jouppi and Wall [1988]2.40 Johnson [1991]2.50 Acosta et al. [1986]2.79 Wedig [1982]3.00 Butler et al. [1991]5.8 Melvin and Patt [1991]6 Wall [1991]7 Kuck et al. [1972]8 Riseman and Foster [1972]51 Nicolau and Fisher [1984] 90  90 instrs in parallel!!!

3 Barrier to Performance Wide Superscalar –Many Functional Units (ALUs, Ld/St Units, etc) –Lots of “potential” performance if all are busy –Fact: often idle!!! –Idle FUs  actual performance << potential performance Cause of idle FUs? –Waiting for memory results –Out-of-order: fetch more instructions while waiting –Limits on ILP  only 2 or 3 instructions available –Reach more load/store instructions  miss-under-miss  wait longer  more idle FUs How to extract parallelism? –Can try to explicitly write a parallel program –Most languages are inherently sequential –Humans break down complex tasks sequentially –Difficult to write a “parallel program” to make parallelism explicit

4 More Parallelism? Multithreading Key observation –Hard to get parallelism out of 1 program –Latency: execution time of 1 program –Difficult to improve latency! Doomed! Give up! Concorde vs Boeing 747? –Concorde: 2170 km/h  NYC to London in 2.6 hrs –747: 980 km/h  NYC to London in 5.7 hrs –Concorde is faster, has lower latency BUT –747 carries 3.5 times more people –747 throughput is higher: 1.6 times more people*km / hr –Airlines prefer 747 to the Concorde Multithreading: carry more programs, improve throughput! –Compute centres prefer CPUs with higher throughput

5 Which has Greater Performance? AirplanePassenger Capacity (ppl) Range (km)Speed (km/h) Throughput (ppl*km/h) Boeing Boeing BAC/Sud Concorde Douglas DC

6 Multithreading: Basic Idea Execute program 1 –FUs busy –Cache miss Switch to program 2 –FUs busy –Cache miss Switch to program 3 –Etc … Switch to program 1 –Cache should now have data from memory

7 Multithreading: Messy Details Multithreading for max of 4 programs –4 programs  4 Program Counters and 4 Register Files –Share Data cache? Programs are competing with each other Program 1 may evict data necessary for others –Multiple Data caches? Each one smaller If only running 1 program, can’t use whole cache If 3 programs don’t need much cache, Program 1 can’t try to use “unused” part –Instruction cache: shared or multiple? –Share TLB? Bigger TLB? Must prevent program 1 from accessing data in programs 2, 3, 4!!! –How to switch between 4 programs fairly and quickly Ensure no single program hogs or starves CPU

8 Multithreading Limits Executing 1 program at a time –Maybe switch programs after 20 to 100 instructions Still extracting parallelism from 1 program at a time –Many FUs still idle Need more parallelism!

9 Simultaneous MultiThreading (SMT) Switch between programs more frequently –Switch after 1 to 20 instructions! Wide superscalar has many FUs (eg, 10) –Dispatch 10 instructions every clock cycle Simultaneous Multithreading –Allow issue of instructions from multiple programs every clock cycle This is the key difference from regular multithreading! –Find small amounts of parallelism from multiple programs –Combine their parallelism: keeps FUs very busy !!!! –Still many messy details….

10 “Exploiting Choice…” Paper Earlier version published ISCA 1995 –Mostly “hypothetical” (idealistic), many assumptions –ISCA = International Symposium on Computer Architecture This is The Conference on Computer Architecture Best of the best! This version published ISCA, 1996 –Many improvements over 1995 paper: adds realism! –Key idea: is it practical to build real CPU with SMT? –Key contribution: yes, how to and why you want to!

11 SMT Study: Baseline SuperScalar CPU Lots of caches –32KB Instr cache, 32KB Data cache –256KB L2 combined I+D cache –2MB L3 cache –Lockup-free Fetch (up to) 8 instructions per cycle BranchPrediction + BranchTargetBuffer + SubroutineReturnStacks –256 entries in BTB, 2K x 2bits for BranchPrediction Buffer Register Renaming –32 regs more for renaming Two InstructionQueues: 1 Floating-Point, 1 Integer –32 entries in each queue –3 Floating-Point Units –6 Integer Units: 4 can also do Loads/Stores

12 Baseline CPU + Multithreading PCs

13 a) Baseline CPU Pipeline, b) Multithreaded Pipeline

14 CPU Performance on 1 Program Use Multiflow compiler (good choice!) Baseline pipeline (does not support multithreading) –Aka “unmodified superscalar” –1/CPI = 2.16 instructions per cycle Longer pipeline (required to support multithreading) –1/CPI = 2.11 instructions per cycle Not much harm due to pipeline changes –Longer pipeline has larger misprediction penalty –Misprediction is rare, so not much increase in CPI

15 CPU Performance on >1 Program Modify: extra PCs, extra registers, and longer pipeline for SMT

16 Improved Parallelism! Throughput dramatically improves Can execute extra programs “almost for free” –Some slowdown to first program, but better throughput! Example –One program: CPI = 1/2.11 = –Total execution time of 6 programs run sequentially Proportional to 6 / 2.11 = 2.84 –Six programs SMT: CPI = ¼ = 0.25 –Total execution time of 6 programs run simultaneously: Proportional to 6/4 = 1.5 –Speedup with SMT = 2.84 / 1.5 = 1.9 –Nearly twice the throughput!!! Consider SMT of 6 nearly-equal programs –All programs start & finish around same time –Latency of each program = total runtime of all 6 programs –Without SMT, latency of each program = 1/6 total runtime of all 6 programs –If you need “partial results early”, lower latency is better (don’t use SMT)

17 ISCA’96 Paper Continues… Best way to fill InstrQueues Best way to keep FUs busy Best way to fetch from Instrs from cache Unmodified SMT speedup = 1.9 Combining “best” techniques speedup = 2.5 SMT is great for throughput!

18 Fetching Instructions from Cache RR.1.8 is the “baseline” RR.2.8 is “best” RR.1.8 fetches up to 8 instrs. from 1 program every cycle RR is “round robin”: next cycle always fetches instructoins from the next program (program i+1)

19 Choosing which Thread to Fetch (versus Round-Robin) RR1.8 is the “baseline”, ICOUNT.2.8 is “best” ICOUNT: fetch from programs with fewest # of instructions in InstrQueue (IQ)

20 Improving Fetch Efficiency (Longer IQ, Avoid Icache Misses) BIGQ: Bigger IQ doesn’t help much. ITAG: Check Instr cache 1 cycle ahead, if cache miss don’t try to fetch instructions for that program. Helps sometimes.